Skip to content

Conversation

@radu-mocanu
Copy link
Contributor

@radu-mocanu radu-mocanu commented Oct 13, 2025

Done:

  • added new evaluators
  • support new schema
  • wired up platform evaluators
  • support for custom evaluators

Tip: check calculator sample

ToDo:

  • rename from coded to legacy evarywhere

How to test:
example contains evaluator spec:

 {
  "version": "1.0",
  "id": "DenialCodeContains",
  "description": "Checks if the response text includes the expected denial code.",
  "evaluatorTypeId": "uipath-contains",
  "evaluatorConfig": {
    "name": "Denial Code Contains",
    "targetOutputKey": "report",
    "negated": true,
    "ignoreCase": false,
    "defaultEvaluationCriteria": {
      "searchText": "mock report"
    }
  }
}

example eval set:

{
  "version": "1.0",
  "id": "ClaimDenialReview",
  "name": "Claim Denial Review",
  "evaluatorRefs": [
    "DenialCodeContains"
  ],
  "evaluations": [
    {
      "id": "denial-default",
      "name": "Respond with default denial code",
      "inputs": {
        "human_input": "Customer asks for the denial code on claim XFC-01."
      },
      "evaluationCriterias": {
        "DenialCodeContains": null
      }
    },
    {
      "id": "denial-override",
      "name": "Respond with override denial code",
      "inputs": {
        "human_input": "Customer asks if claim XFC-02 was denied and why."
      },
      "evaluationCriterias": {
        "DenialCodeContains": {
          "searchText": "mock sreport 1234 lala"
        }
      }
    },
    {
      "id": "denial-skip",
      "name": "Skip denial code check",
      "inputs": {
        "human_input": "Customer checks status of claim XFC-03 with no denial expected."
      },
      "evaluationCriterias": {}
    }
  ]
}

Development Package

  • Add this package as a dependency in your pyproject.toml:
[project]
dependencies = [
  # Exact version:
  "uipath==2.1.108.dev1006852164",

  # Any version from PR
  "uipath>=2.1.108.dev1006850000,<2.1.108.dev1006860000"
]

[[tool.uv.index]]
name = "testpypi"
url = "https://test.pypi.org/simple/"
publish-url = "https://test.pypi.org/legacy/"
explicit = true

[tool.uv.sources]
uipath = { index = "testpypi" }

@radu-mocanu radu-mocanu force-pushed the release/revamped-evals branch from a3e9908 to 333821b Compare October 13, 2025 14:46
mjnovice and others added 9 commits October 14, 2025 16:21
feat: wiring ExactMatch evaluator to new schema
feat: wiring JsonSimilarity evaluator to new schema
- Add version property detection to distinguish coded-evals from legacy files
- Update pull command to map coded-evals folder to local evals structure
- Update push command to upload files with version property to coded-evals folder
- Maintain backward compatibility with legacy evals folder structure
- Ensure eval command works out of the box with existing structure

fix: resolve eval set path for correct evaluator discovery

- Update load_eval_set to return both evaluation set and resolved path
- Fix evaluator discovery by using resolved path instead of original path
- Ensure eval command works with files in evals/eval-sets/ and evals/evaluators/

fix: cleaning up files

fix: address PR review comments

1. Move eval_set path resolution from runtime to CLI layer
   - Resolve path in cli_eval.py before creating runtime
   - Remove context update in runtime since path is already resolved
   - Better separation of concerns

2. Clarify directory structure comments
   - Make it explicit that os.path.join produces {self.directory}/evals/evaluators/
   - Prevent confusion about directory paths

3. Add file deletion consistency for evaluation files
   - Delete remote evaluation files when deleted locally
   - Matches behavior of source file handling
   - Ensures consistency across all file types

Addresses: #681 (review)
Addresses: #681 (review)
Addresses: #681 (comment)
feat: adding pull and push for coded-evals folder files
feat: wiring LLM judge evaluators to new schema
@Chibionos Chibionos changed the title Release/revamped evals feat: new eval schema changes Oct 16, 2025
@Chibionos Chibionos requested a review from akshaylive October 16, 2025 12:16
andrei-rusu and others added 13 commits October 16, 2025 18:23
…_evals

feat: Cherry-pick progress on parallelization of eval runs
feat: missing changes from llm eval wiring
Add dedicated environment variable for eval endpoint routing with environment-aware
localhost detection using proper URL parsing, avoiding false positives and impact
to other services using UIPATH_URL.

Added `UIPATH_EVAL_BACKEND_URL` for eval-specific routing:
- Set to localhost URL (e.g., `http://localhost:8080`) for local development
- Leave unset or set to production URL for alpha/production environments
- Isolates eval endpoint routing from UIPATH_URL used by other services

**New Constant:**
- `ENV_EVAL_BACKEND_URL = "UIPATH_EVAL_BACKEND_URL"` in `constants.py`

**Updated Helper Method with Robust URL Parsing:**
- `_get_endpoint_prefix()` uses `urllib.parse.urlparse()` for accurate hostname detection
- Checks parsed hostname specifically (not substring matching)
- Prevents false positives like "notlocalhost.com" or "127.0.0.1.example.com"
- Returns `""` (empty) only when hostname is exactly "localhost" or "127.0.0.1"
- Returns `"agentsruntime_/"` for all other cases (including unset or parse failures)
- Handles edge cases: case-insensitive matching, ports, protocols

All 4 progress reporting endpoints now use `/coded/` path with conditional routing:

| Method | Endpoint Pattern |
|--------|------------------|
| PUT evalRun | `{prefix}api/execution/agents/{id}/coded/evalRun` |
| POST evalRun | `{prefix}api/execution/agents/{id}/coded/evalRun` |
| POST evalSetRun | `{prefix}api/execution/agents/{id}/coded/evalSetRun` |
| PUT evalSetRun | `{prefix}api/execution/agents/{id}/coded/evalSetRun` |

Where `{prefix}` is determined by `_get_endpoint_prefix()`:
- Localhost: `""` (empty - direct API access)
- Alpha/Prod: `"agentsruntime_/"` (service routing)

- `_update_eval_run_spec()` - Update eval run with results
- `_create_eval_run_spec()` - Create new eval run
- `_create_eval_set_run_spec()` - Create new eval set run
- `_update_eval_set_run_spec()` - Update eval set run completion

```bash
export UIPATH_EVAL_BACKEND_URL=http://localhost:8080
```

```bash
```

```bash
export UIPATH_EVAL_BACKEND_URL=https://alpha.uipath.com
```

✅ **Isolated Configuration:**
- Eval routing independent of UIPATH_URL
- Other services using UIPATH_URL remain unaffected
- Enables local eval testing without affecting other components

✅ **Robust URL Parsing:**
- Uses `urllib.parse.urlparse()` for accurate hostname extraction
- Prevents false positives from substring matching (e.g., "notlocalhost.com")
- Handles edge cases: ports, protocols, case sensitivity
- Graceful fallback on parsing errors

✅ **Simple & Explicit:**
- Single environment variable controls all eval endpoint routing
- Clear localhost detection (exact hostname match)
- Defaults to production routing when unset (safe fallback)

✅ **Backward Compatible:**
- No breaking changes to existing deployments
- Defaults to `agentsruntime_/` prefix when env not set
- Supports new `/coded/` evaluator API endpoints

✅ **Validation:**
- Syntax validation: Passed
- Logic verification: Passed (13 test scenarios including edge cases)
- Files changed: 2 modified (+29, -5 lines)

✅ **Environment Detection Tests:**

**Standard Cases:**
- `http://localhost:8080` → ✓ Empty prefix
- `http://127.0.0.1:3000` → ✓ Empty prefix
- `https://localhost` → ✓ Empty prefix
- `https://alpha.uipath.com` → ✓ `agentsruntime_/` prefix
- `https://cloud.uipath.com` → ✓ `agentsruntime_/` prefix
- Unset/empty → ✓ `agentsruntime_/` prefix (default)

**Edge Cases (False Positive Prevention):**
- `https://notlocalhost.com` → ✓ `agentsruntime_/` prefix (not localhost)
- `https://127.0.0.1.example.com` → ✓ `agentsruntime_/` prefix (not localhost)
- `https://mylocalhost.io` → ✓ `agentsruntime_/` prefix (not localhost)

**Case Sensitivity:**
- `http://LOCALHOST:8080` → ✓ Empty prefix (case-insensitive)
- `http://LocalHost:8080` → ✓ Empty prefix (case-insensitive)

**Problem:** Simple substring check `"localhost" in url` could match `"notlocalhost.com"`
**Solution:** Use `urlparse()` to extract exact hostname, preventing false positives

**Status:** Verified - documentation correctly references `UIPATH_EVAL_BACKEND_URL` throughout

- `src/uipath/_cli/_evals/_progress_reporter.py` (+29, -5)
  - Added `urllib.parse.urlparse` import
  - Improved `_get_endpoint_prefix()` with URL parsing logic
- `src/uipath/_utils/constants.py` (+1)
  - Added `ENV_EVAL_BACKEND_URL` constant

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
feat: wire tool evals, add mocked tool sample agent
refac: coded_evalutors -> evaluators, associated renames/changes
feat(evals): add dedicated UIPATH_EVAL_BACKEND_URL for localhost routing
feat: add support for custom evaluators
Copy link
Collaborator

@akshaylive akshaylive left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH this PR is way too big to review in one shot. If the individual PRs have already been approved, please merge it as-is, and address my comments in a subsequent PR please.

assertion_runs, evaluator_scores = self._collect_results(
sw_progress_item.eval_results, evaluators, spans or [] # type: ignore
)
spec = self._update_eval_run_spec(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

evaluator_runs, evaluator_scores = ..

This way, you can remove the duplicate lines for spec = ...

)


class LegacyEvaluationItem(BaseModel):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please extend EvaluationItem. We can also get rid of AnyEvaluationItem.

self.evaluations = selected_evals


class LegacyEvaluationSet(BaseModel):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please extend EvaluationSet.

"""
if not evaluators:
return False
# Check the first evaluator type
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only first?

no_of_evals: int
evaluators: List[Any]
# skip validation to avoid abstract class instantiation
evaluators: SkipValidation[List[AnyEvaluator]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EvalSetRunCreatedEvent should not be a BaseModel IMO. Please switch to @dataclass instead..

)
return assertion_runs, evaluator_scores_list

def _collect_coded_results(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid duplication of code? High overlap between this and _collect_results.

Discriminator(_discriminate_eval_set),
]

AnyEvaluationItem = Union[EvaluationItem, LegacyEvaluationItem]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get rid of these two Any types..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, curious though what downside do you see with this ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main issue is that currently LegacyEvaluationItem does not extend EvaluationItem but it semantically should. Any* for these purposes is an anti-pattern IMO.


result.evaluation_time = execution_time
return result
class BaseEvaluator(BaseModel, Generic[T, C, J], ABC):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use @dataclass from dataclasses instead of pydantic since types are not serializable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless, this base class is overly complex. Have we over-engineered this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@github-actions github-actions bot added test:uipath-langchain Triggers tests in the uipath-langchain-python repository test:uipath-llamaindex Triggers tests in the uipath-llamaindex-python repository labels Oct 23, 2025
@mjnovice mjnovice force-pushed the release/revamped-evals branch 4 times, most recently from db9a265 to 7efe060 Compare October 23, 2025 10:40
@mjnovice mjnovice force-pushed the release/revamped-evals branch from 7efe060 to 73d8d30 Compare October 23, 2025 10:46
@mjnovice mjnovice force-pushed the release/revamped-evals branch from 73d8d30 to 602854e Compare October 23, 2025 14:44
@akshaylive akshaylive added the build:dev Create a dev build from the pr label Oct 23, 2025
@mjnovice mjnovice force-pushed the release/revamped-evals branch from 35fc809 to 6cb6db5 Compare October 23, 2025 21:02
akshaylive and others added 3 commits October 23, 2025 17:01
These are required for build.
fix(TonOfFixes): lots of minor fixes
Model default is provided so updating the unit test.
@@ -0,0 +1 @@
print("to be implemented later")
Copy link
Contributor

@ionmincu ionmincu Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont forget about this
You can check the output file, see the logs

Printing output file...
=== OUTPUT FILE CONTENT ===
{
  "output": {
    "evaluationSetName": "Weather Tools Agent Evaluation",
    "evaluationSetResults": [

then assert the json is like a schema or contains some known properties like `"evaluatorName": "ToolCallArgsEvaluator" with result and a score.

@radu-mocanu radu-mocanu merged commit 3396560 into main Oct 24, 2025
61 of 62 checks passed
@radu-mocanu radu-mocanu deleted the release/revamped-evals branch October 24, 2025 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build:dev Create a dev build from the pr test:uipath-langchain Triggers tests in the uipath-langchain-python repository test:uipath-llamaindex Triggers tests in the uipath-llamaindex-python repository

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants